MATH 318 Final Project: The Effect of World Cup on Stock Price and
Trading Activity
library(tidyverse)
library(dplyr)
library(MASS)
library(lubridate)
library(stringr)
library(GGally)
library(knitr)
library(kableExtra)
library(tidyquant)
Loading required package: PerformanceAnalytics
Loading required package: xts
Loading required package: zoo
Attaching package: ‘zoo’
The following objects are masked from ‘package:base’:
as.Date, as.Date.numeric
################################### WARNING ###################################
# We noticed you have dplyr installed. The dplyr lag() function breaks how #
# base R's lag() function is supposed to work, which breaks lag(my_xts). #
# #
# Calls to lag(my_xts) that you enter or source() into this session won't #
# work correctly. #
# #
# All package code is unaffected because it is protected by the R namespace #
# mechanism. #
# #
# Set `options(xts.warn_dplyr_breaks_lag = FALSE)` to suppress this warning. #
# #
# You can use stats::lag() to make sure you're not using dplyr::lag(), or you #
# can add conflictRules('dplyr', exclude = 'lag') to your .Rprofile to stop #
# dplyr from breaking base R's lag() function. #
################################### WARNING ###################################
Attaching package: ‘xts’
The following objects are masked from ‘package:dplyr’:
first, last
Attaching package: ‘PerformanceAnalytics’
The following object is masked from ‘package:graphics’:
legend
Loading required package: quantmod
Loading required package: TTR
Registered S3 method overwritten by 'quantmod':
method from
as.zoo.data.frame zoo
Introduction
Our goal in this project was to explore the effect the FIFA World Cup
has on the stock market. In researching to prepare, we found that the
World Cup has a generally negative impact on the market. According to an
article from MarketWatch, a country's average market return
significantly drops after their team is eliminated from the Cup. The
inverse can't be said for team wins, which leads to the overall market
decreasing. Our interest was to investigate the market trends,
specifically regarding the S&P 500, in relation to the times of
World Cup games.Â
Exploration
candleStick_plot<-function(symbol,from,to){
tq_get(symbol,
from = from,
to = to,
warnings = FALSE) %>% mutate(greenRed=ifelse(open-close>0,
"Red",
"Green")) %>%
ggplot()+
geom_segment(aes(x = date,
xend=date,
y =open,
yend =close,
colour=greenRed),
size=3)+
theme_tq()+
geom_segment(aes(x = date,
xend=date,
y =high,
yend =low,
colour=greenRed))+
scale_color_manual(values=c("Forest Green","Red"))+
ggtitle(paste0(symbol," (",from," - ",to,")"))+
theme(legend.position ="none",
axis.title.y = element_blank(),
axis.title.x=element_blank(),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
plot.title= element_text(hjust=0.5))
}
candleStick_plot("SPY", from = '2002-05-31', to = '2002-06-30')
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
Please use `linewidth` instead.

candleStick_plot("SPY", from = '2006-06-09', to = '2006-07-09')

candleStick_plot("SPY", from = '2010-06-11',to = '2010-07-11')

candleStick_plot("SPY", from = '2014-06-12',to = '2014-07-13')

Load Data
spydata = as_tibble(read.csv("1_min_SPY_2008-2021.csv"))
worldcupmatches = as_tibble(read.csv("WorldCupMatches.csv"))
We were interested in looking at the general price of the market
during the world cup, to confirm if our data shows the proposed trend of
decreased price during the World Cup we found in our research. To do
this we used a graphing method called candlesticks plots. These plots
show the average and variation of the price over a given month. Each
plot spanned the month of the World Cup for the years 2002, 2006, 2010,
2014. Both the graph for 2002 and 2010 display a generally decreasing
trend throughout the month. However, the 2006 and 2014 graphs instead
show a general increase in price over time. To better understand trends
in the data, we then looked closer at information from singular
games.Â
Helper Functions
Get Only Spy Data for a Specific Game
#Clean SpyData Per Game
getSpyDataWithinRangeofGame = function(spydata,game_date,range){
rangeInSeconds = 60 * 60 * range
return(filter(spydata, (date >= game_date - rangeInSeconds) & (date <= game_date + rangeInSeconds)))
}
Check if there is Spy data for a game
hasSpyDataWithinRangeOfGame = function(spydata, game_dates, range){
list = c()
for(game_date in 1:length(game_dates)){
list = append(list, nrow(getSpyDataWithinRangeofGame(spydata, game_date = game_dates[game_date], range)) != 0)
}
return(list)
}
Get Spy Data combined with Game Data for a set of Games
#Clean Spydata Per Worldcup Returns a set of spydata with their corresponding game data
getSpyAndGameDataWithinWorldcup = function(worldcup, spydata, range){
z = getSpyDataWithinRangeofGame(spydata, worldcup[[1,"Datetime"]], range)
gamerow = worldcup[1,]
for(colIndx in 1: ncol(gamerow)){
colvalue = worldcup[[1, colIndx]]
colname = colnames(worldcup)[colIndx]
z[colname] = rep(colvalue, times= nrow(z))
}
z["time.from.game"] = as.numeric(difftime(z$date, worldcup[[1,"Datetime"]],units = "secs"))
for(gameIndx in 2:nrow(worldcup)){
x = getSpyDataWithinRangeofGame(spydata, worldcup[[gameIndx,"Datetime"]], range)
gamerow = worldcup[gameIndx,]
for(colIndx in 1: ncol(gamerow)){
colvalue = worldcup[[gameIndx, colIndx]]
colname = colnames(worldcup)[colIndx]
x[colname] = rep(colvalue, times= nrow(x))
}
x["time.from.game"] = as.numeric(difftime(x$date, x$Datetime, units = "secs"))
z = union_all(z,x)
}
return(z)
}
Get Spy Data combined with Game Data for a single game
#Gives Spydata and Difference from the game time for each worldcup game
getSpyAndGameDataForOneGame = function(spydata,worldcup, game_index, range){
z = getSpyDataWithinRangeofGame(spydata, worldcup[[game_index,"Datetime"]], range)
for(colIndx in 1: ncol(worldcup[game_index,])){
colvalue = worldcup[[game_index, colIndx]]
colname = colnames(worldcup)[colIndx]
z[colname] = rep(colvalue, times= nrow(z))
}
z["time.from.game"] = as.numeric(difftime(z$date, rep(worldcup[[game_index,"Datetime"]], times = nrow(z)), units="secs"))
return(z)
}
Cleaning Data
#Update
spydata$date = as.POSIXct(spydata$date, format="%Y%m%d %H:%M:%S")
#Remove Rows containing NA's
cleaned_worldcupmatches = unique(worldcupmatches[!apply(is.na(worldcupmatches) | worldcupmatches == "", 1, all),])
#Convert Date and Time into POSIX EDT
cleaned_worldcupmatches$Datetime = as.POSIXct(cleaned_worldcupmatches$Datetime, format = "%e %b %Y - %R") - 60 * 60
#Filter Games on the Weekend
cleaned_worldcupmatches = filter(cleaned_worldcupmatches, wday(as.Date(Datetime)) != 7 & wday(as.Date(Datetime)) != 1)
#Filter Games that have no corresponding Data
cleaned_worldcupmatches = add_column(cleaned_worldcupmatches,"HasSpyData" = hasSpyDataWithinRangeOfGame(spydata, cleaned_worldcupmatches$Datetime, 3))
cleaned_worldcupmatches = filter(cleaned_worldcupmatches, cleaned_worldcupmatches$HasSpyData == TRUE)
cleaned_worldcupmatches
Get Spy Data
#Get Spy data within 3 hours of the Game For both World cups
allspydata2014 = getSpyAndGameDataWithinWorldcup(filter(cleaned_worldcupmatches, Year==2014), spydata, 3)
allspydata2010 = getSpyAndGameDataWithinWorldcup(filter(cleaned_worldcupmatches, Year==2010), spydata, 3)
Normalizing the average price
#Normalize the Data so that we can compare fairly the growth of stock price
max.2014 = max(allspydata2014$average)
max.2010 = max(allspydata2010$average)
min.2014 = min(allspydata2014$average)
min.2010 = min(allspydata2010$average)
difference.2014 = (max.2014) - (min.2014)
difference.2010 = (max.2010) - (min.2010)
normalized2014average = (allspydata2014$average-min.2014)/difference.2014
normalized2010average = (allspydata2010$average-min.2010)/difference.2010
allspydata2014 = add_column(allspydata2014, "normalized.average" = normalized2014average)
allspydata2010 = add_column(allspydata2010, "normalized.average" = normalized2010average)
allspydata = union_all(allspydata2014, allspydata2010)
allspydata
NA
Before we could perform any tests on the data sets, we first had to
take into account the stock price differences between the years 2010 and
2014. In order to normalize the stock price data, we had to split the
data into each year and normalize the prices within that year. The
method we chose to normalize the prices was subtracting the minimum
price for that year from each price. This value was then divided by the
difference between the maximum and minimum price for that year. Each
normalized price is a value between zero and one. When interpreting
these values, the closer the normalized value is to one, the closer the
price is to the maximum for that year. Similarly, the closer the value
is to zero, the closer the price is to the minimum for that year. By
normalizing the prices for each year, we created an even playing field
for price data, as they are all relative to the stock market climate in
their given year.
Exploring the Relationship of Time and Volume
Let us explore the Correlation between the Time form game and the
Price of the stock
Correlations
ggpairs(allspydata, columns = c("time.from.game", "average", "volume", "normalized.average"))

kable(cor(allspydata[, c(7,9,31,32)]))
|
|
volume
|
average
|
time.from.game
|
normalized.average
|
|
volume
|
1.0000000
|
-0.4039908
|
0.0334462
|
0.1264736
|
|
average
|
-0.4039908
|
1.0000000
|
0.2149680
|
-0.3961910
|
|
time.from.game
|
0.0334462
|
0.2149680
|
1.0000000
|
0.0158785
|
|
normalized.average
|
0.1264736
|
-0.3961910
|
0.0158785
|
1.0000000
|
After cleaning the data set, we computed the cross-correlations for
the variables: time from the game, volume of trades, average price, and
normalized average price. We were most interested in seeing how time
from game affected volume of trades since volume shows stock
productivity. We found that time from game had the highest correlation
with average price, the correlation was 0.215. We also found that the
correlation between time from game and volume was .033. These low values
make sense given the data we had access to. We did not have data for
commonly traded stocks by day traders. Nevertheless, we wanted to
regress with volume as the dependent variable and time from game as the
independent variable. After attempting multiple regression types,
nonlinear regression gave us the best fit.Â
Linear Regression
ggplot(allspydata) + geom_point(aes(time.from.game,normalized.average))

ggplot(allspydata, aes(time.from.game,log(volume))) + geom_point() + geom_smooth() + geom_vline(xintercept= 0, color = "red") + geom_vline(xintercept = 5400, color ="red")

The model we used for nonlinear regression is a Generalized Additive
Model, also known as a GAM. A GAM is made up of splines, splines are
smooth polynomial functions that cover a small range. These splines
replace the beta coefficients that we normally see in linear regression,
so that we can regress on nonlinear data. When we graph our nonlinear
model over the volume data, we see that there is a spike in volume of
trades 5000 seconds before the time of the game, which is about an hour
and a half before the game. After this spike, there is a gradual
increase once the game starts and the positive trend continues as the
game finishes.
Sample Games from a World Cup
set.seed(100)
samplesize = 10
worldcupmatches2014 = filter(cleaned_worldcupmatches, Year == 2014)
sampleworldcupgames = sample_n(worldcupmatches2014,size = samplesize)
game1 <- getSpyAndGameDataForOneGame(spydata, sampleworldcupgames, 1, 3)
volumesp1 <- ggplot(game1, mapping = aes(x=date, y=log(volume))) + geom_point(aes(color =average)) + geom_vline(xintercept= sampleworldcupgames$Datetime[1]) + ggtitle("Volume over Time, Sample Game 1")
volumesp1

game2 <- getSpyAndGameDataForOneGame(spydata, sampleworldcupgames, 2, 3)
volumesp2 <- ggplot(game2, mapping = aes(x=date, y=log(volume))) + geom_point(aes(color =average)) + geom_vline(xintercept= sampleworldcupgames$Datetime[2])+ ggtitle("Volume over Time, Sample Game 2")
volumesp2

#ERROR AGAIN WE DONT HAVE FULL DATA FOR THIS GAME
game3 <- getSpyAndGameDataForOneGame(spydata, sampleworldcupgames, 3, 3)
volumesp3 <- ggplot(game3, mapping = aes(x=date, y=log(volume))) + geom_point(aes(color =average)) + geom_vline(xintercept= sampleworldcupgames$Datetime[3])+ ggtitle("Volume over Time, Sample Game 3")
volumesp3

game4 <- getSpyAndGameDataForOneGame(spydata, sampleworldcupgames, 4, 3)
volumesp4 <- ggplot(game4, mapping = aes(x=date, y=log(volume))) + geom_point(aes(color =average)) + geom_vline(xintercept= sampleworldcupgames$Datetime[4])+ ggtitle("Volume over Time, Sample Game 4")
volumesp4

game5 <- getSpyAndGameDataForOneGame(spydata, sampleworldcupgames, 5, 3)
volumesp5 <- ggplot(game5, mapping = aes(x=date, y=log(volume))) + geom_point(aes(color =average)) + geom_vline(xintercept= sampleworldcupgames$Datetime[5])+ ggtitle("Volume over Time, Sample Game 5")
volumesp5

Price Scatter Plots
Using average price
pricesp1 <- ggplot(game1) + geom_point(mapping = aes(x=date, y=average), color ="red") + geom_vline(xintercept= sampleworldcupgames$Datetime[1])+ ggtitle("Price over Time, Sample Game 1")
pricesp1

pricesp2 <- ggplot(game2) + geom_point(mapping = aes(x=date, y=average), color ="red") + geom_vline(xintercept= sampleworldcupgames$Datetime[2])+ ggtitle("Price over Time, Sample Game 2")
pricesp2

#ISSUE BECAUSE WE DONT HAVE DATA FOR 16:00 and that is time of the game
pricesp3 <- ggplot(game3) + geom_point(mapping = aes(x=date, y=average), color ="red") + geom_vline(xintercept= sampleworldcupgames$Datetime[3])+ ggtitle("Price over Time, Sample Game 3")
pricesp3

pricesp4 <- ggplot(game4) + geom_point(mapping = aes(x=date, y=average), color ="red") + geom_vline(xintercept= sampleworldcupgames$Datetime[4])+ ggtitle("Price over Time, Sample Game 4")
pricesp4

pricesp5 <- ggplot(game5, aes(date)) + geom_point(aes(y=average, ), color ="red") + geom_vline(xintercept= sampleworldcupgames$Datetime[5])+ ggtitle("Price over Time, Sample Game 5")
pricesp5

We next looked at the volume of trading over a range of time from
five sample games. Using ggpoint, we plotted the scatter plot of the
market volume over a period of a few hours before and after the start of
the game, and we indicated the start time with a vertical line on the
graph. The values for volume are very large and varied, which makes the
graphs difficult to read. To make them more digestible, we instead used
the log of volume as the dependent variable, giving us smaller numbers.
We found that, in general, there is a slight dip in the volume around
the start time of the game. This agrees with findings from our research
that stock market activity drops durring the World Cup as more attention
is directed towards the games. We also used scatter plots to look at the
average price of stocks over a range of time around the start time of
sample games. As for the volume plots, we indicated the start time of
the game with a vertical line on the graph. The figures show that the
average price tends to drop just before the start of the game, and then
increase. These trends in price lead us to explore the times it would be
best to both buy and sell stocks.
Just USA Games
worldcupmatchesUSAhome = filter(cleaned_worldcupmatches, Home.Team.Name == "USA")
worldcupmatchesUSAaway = filter(cleaned_worldcupmatches, Away.Team.Name == "USA")
worldcupmatchesUSA <- union_all(worldcupmatchesUSAaway,worldcupmatchesUSAhome)
worldcupmatchesUSA
game1USA <- getSpyAndGameDataForOneGame(spydata, worldcupmatchesUSA, 1, 3)
volumesp1USA <- ggplot(game1USA) + geom_point(mapping = aes(x=date, y=log(volume)), color ="red") + geom_vline(xintercept= worldcupmatchesUSA$Datetime[1])+ ggtitle("Volume over Time, USA Game 1, 2010")
volumesp1USA

game2USA <- getSpyAndGameDataForOneGame(spydata, worldcupmatchesUSA, 2, 3)
volumesp2USA <- ggplot(game2USA) + geom_point(mapping = aes(x=date, y=log(volume)), color ="red") + geom_vline(xintercept= worldcupmatchesUSA$Datetime[2])+ ggtitle("Volume over Time, USA Game 2, 2014")
volumesp2USA

game3USA <- getSpyAndGameDataForOneGame(spydata, worldcupmatchesUSA, 3, 3)
volumesp3USA <- ggplot(game3USA) + geom_point(mapping = aes(x=date, y=volume), color ="red") + geom_vline(xintercept= worldcupmatchesUSA$Datetime[3])+ ggtitle("Volume over Time, USA Game 2, 2010")
volumesp3USA

game4USA <- getSpyAndGameDataForOneGame(spydata, worldcupmatchesUSA, 4, 3)
volumesp4USA <- ggplot(game4USA) + geom_point(mapping = aes(x=date, y=volume), color ="red") + geom_vline(xintercept= worldcupmatchesUSA$Datetime[4])+ ggtitle("Volume over Time, USA Game 1, 2014")
volumesp4USA

pricesp4USA <- ggplot(game4USA) + geom_point(mapping = aes(x=date, y=average), color ="red") + geom_vline(xintercept= worldcupmatchesUSA$Datetime[4])+ ggtitle("Price over Time, USA Game 1, 2014")
pricesp4USA

Is there a way to make money during World Cup Games?
What time should you look to sell?
price_list = c()
time_list = c()
for (game in 1:nrow(cleaned_worldcupmatches)) {
price_game_data <- getSpyAndGameDataForOneGame(spydata, cleaned_worldcupmatches, game, 3)
highestdatapoint = price_game_data[which.max(price_game_data$average),]
highestpriceforgame <- highestdatapoint$average
time <- highestdatapoint$"time.from.game"
price_list = append(price_list, highestpriceforgame)
time_list = append(time_list, time)
}
optimal_selling.df <- data.frame("price" = price_list, "time" = time_list)
optimal_selling.df
Do THis for Optimal Buying time?
ggplot(optimal_selling.df) + geom_histogram(mapping = aes(x= time)) + ggtitle("How Often Best Selling Time is vs. Time From Game")

What time should you look to buy?
price_list = c()
time_list = c()
for (game in 1:nrow(cleaned_worldcupmatches)) {
price_game_data <- getSpyAndGameDataForOneGame(spydata, cleaned_worldcupmatches, game, 3)
lowestdatapoint = price_game_data[which.min(price_game_data$average),]
lowestpriceforgame <- lowestdatapoint$average
time <- lowestdatapoint$"time.from.game"
price_list = append(price_list, lowestpriceforgame)
time_list = append(time_list, time)
}
optimal_buying.df <- data.frame("price" = price_list, "time" = time_list)
ggplot(optimal_buying.df) + geom_histogram(mapping = aes(x= time)) + ggtitle("How Often Best Buying Time is vs. Time From Game")

Another idea we wanted to explore was the optimal buying and selling
times for SPY. The optimal buying time would be when the SPY price is
the lowest and the optimal selling time would be when the SPY price is
the highest. In order to find the optimal times from the game, we found
the maximum and minimum price of SPY during each game. From there, we
found the corresponding time from game for these maximum and minimum
price values. With this data, we created two histograms: one that
demonstrates the optimal buying times from a world cup game and another
that demonstrated the optimal selling times from a world cup game. The
histograms display the time from game on the x-axis and show the amount
of hames that have their optimal selling or buying time at that time
value. For the optimal selling histogram, there are peaks at 10,000
seconds and 5,000 seconds before the game. These two time values
correspond to around three hours before the game and one and a half
hours before the game, respectively. There were other outliers in the
histogram, but these two values seem to be the best times around a world
cup game to sell SPY stock, based on our data. For the optimal buying
histogram, there was a large peak around 10,000 seconds before the game,
which is three hours before the game. The histogram is skewed right, so
the times where the SPY price was the lowest was before the game. From
this histogram, it appears that the best time to buy SPY stock is three
hours before the start of a world cup game.Â
Does the importance of the game matter in the total amount of
trades?
allspydata$Stage[grepl("Group", allspydata$Stage, fixed = TRUE)] = "Group"
ggplot(allspydata, aes(log(volume), Stage, fill = Stage)) + geom_boxplot() + ggtitle("Log(Volume) vs. Stage Of Game")

allspydata$Stage[grepl("Group", allspydata$Stage, fixed = TRUE)] = "Group"
ggplot(allspydata, aes(log(volume), Stage, fill = Stage)) + geom_boxplot() + ggtitle("Log(Volume) vs. Stage Of Game")
---
title: "MATH 318 Final Project"
output: html_notebook
---

# MATH 318 Final Project: The Effect of World Cup on Stock Price and Trading Activity

```{r}
library(tidyverse)
library(dplyr)
library(MASS)
library(lubridate)
library(stringr)
library(GGally)
library(knitr)
library(kableExtra)
library(tidyquant)
```

# Introduction

Our goal in this project was to explore the effect the FIFA World Cup has on the stock market. In researching to prepare, we found that the World Cup has a generally negative impact on the market. According to an article from MarketWatch, a country\'s average market return significantly drops after their team is eliminated from the Cup. The inverse can\'t be said for team wins, which leads to the overall market decreasing. Our interest was to investigate the market trends, specifically regarding the S&P 500, in relation to the times of World Cup games. 

## Exploration

```{r}
candleStick_plot<-function(symbol,from,to){
  tq_get(symbol,
        from = from,
        to = to,
        warnings = FALSE) %>% mutate(greenRed=ifelse(open-close>0,
                           "Red",
                           "Green")) %>%
    ggplot()+
    geom_segment(aes(x = date,
                     xend=date,
                     y =open,
                     yend =close,
                     colour=greenRed),
                 size=3)+
    theme_tq()+
    geom_segment(aes(x = date,
                     xend=date,
                     y =high,
                     yend =low,
                     colour=greenRed))+
    scale_color_manual(values=c("Forest Green","Red"))+
    ggtitle(paste0(symbol," (",from," - ",to,")"))+
    theme(legend.position ="none",
          axis.title.y = element_blank(),
          axis.title.x=element_blank(),
          axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          plot.title= element_text(hjust=0.5))
}

```

```{r}
candleStick_plot("SPY", from = '2002-05-31', to = '2002-06-30')
candleStick_plot("SPY", from = '2006-06-09', to = '2006-07-09')
candleStick_plot("SPY", from = '2010-06-11',to = '2010-07-11')
candleStick_plot("SPY", from = '2014-06-12',to = '2014-07-13')
```

# Load Data

```{r}
spydata = as_tibble(read.csv("1_min_SPY_2008-2021.csv"))
```

```{r}
worldcupmatches = as_tibble(read.csv("WorldCupMatches.csv"))
```

We were interested in looking at the general price of the market during the world cup, to confirm if our data shows the proposed trend of decreased price during the World Cup  we found in our research. To do this we used a graphing method called candlesticks plots. These plots show the average and variation of the price over a given month. Each plot spanned the month of the World Cup for the years 2002, 2006, 2010, 2014. Both the graph for 2002 and 2010 display a generally decreasing trend throughout the month. However, the 2006 and 2014 graphs instead show a general increase in price over time. To better understand trends in the data, we then looked closer at information from singular games. 

\
\

# Helper Functions

## Get Only Spy Data for a Specific Game

```{r}
#Clean SpyData Per Game
getSpyDataWithinRangeofGame = function(spydata,game_date,range){
  rangeInSeconds = 60 * 60 * range
  return(filter(spydata, (date >= game_date - rangeInSeconds) & (date <= game_date + rangeInSeconds)))
}
```

## Check if there is Spy data for a game

```{r}
hasSpyDataWithinRangeOfGame = function(spydata, game_dates, range){
  list = c()
  for(game_date in 1:length(game_dates)){
    list = append(list, nrow(getSpyDataWithinRangeofGame(spydata, game_date = game_dates[game_date], range)) != 0)
  }
  return(list)
}
```

## Get Spy Data combined with Game Data for a set of Games

```{r}
#Clean Spydata Per Worldcup Returns a set of spydata with their corresponding game data
getSpyAndGameDataWithinWorldcup = function(worldcup, spydata, range){
  z = getSpyDataWithinRangeofGame(spydata, worldcup[[1,"Datetime"]], range)
  gamerow = worldcup[1,]
  for(colIndx in 1: ncol(gamerow)){
      colvalue = worldcup[[1, colIndx]]
      colname = colnames(worldcup)[colIndx]
      z[colname] = rep(colvalue, times= nrow(z))
  }
  z["time.from.game"] = as.numeric(difftime(z$date, worldcup[[1,"Datetime"]],units = "secs"))
  for(gameIndx in 2:nrow(worldcup)){
   x = getSpyDataWithinRangeofGame(spydata, worldcup[[gameIndx,"Datetime"]], range)
   gamerow = worldcup[gameIndx,]
   for(colIndx in 1: ncol(gamerow)){
      colvalue = worldcup[[gameIndx, colIndx]]
      colname = colnames(worldcup)[colIndx]
      x[colname] = rep(colvalue, times= nrow(x))
   }
   x["time.from.game"] = as.numeric(difftime(x$date, x$Datetime, units = "secs"))
  
   z = union_all(z,x)
  }
  return(z)
}
```

## Get Spy Data combined with Game Data for a single game

```{r}
#Gives Spydata and Difference from the game time for each worldcup game
getSpyAndGameDataForOneGame = function(spydata,worldcup, game_index, range){
  z = getSpyDataWithinRangeofGame(spydata, worldcup[[game_index,"Datetime"]], range)
  for(colIndx in 1: ncol(worldcup[game_index,])){
      colvalue = worldcup[[game_index, colIndx]]
      colname = colnames(worldcup)[colIndx]
      z[colname] = rep(colvalue, times= nrow(z))
  }
  z["time.from.game"] = as.numeric(difftime(z$date, rep(worldcup[[game_index,"Datetime"]], times = nrow(z)), units="secs"))
  return(z)
}
```

# Cleaning Data

```{r}
#Update 
spydata$date = as.POSIXct(spydata$date, format="%Y%m%d %H:%M:%S")

#Remove Rows containing NA's
cleaned_worldcupmatches = unique(worldcupmatches[!apply(is.na(worldcupmatches) | worldcupmatches == "", 1, all),])

#Convert Date and Time into POSIX EDT 
cleaned_worldcupmatches$Datetime = as.POSIXct(cleaned_worldcupmatches$Datetime, format = "%e %b %Y - %R") - 60 * 60

#Filter Games on the Weekend
cleaned_worldcupmatches = filter(cleaned_worldcupmatches, wday(as.Date(Datetime)) != 7 & wday(as.Date(Datetime)) != 1)

#Filter Games that have no corresponding Data
cleaned_worldcupmatches = add_column(cleaned_worldcupmatches,"HasSpyData" = hasSpyDataWithinRangeOfGame(spydata, cleaned_worldcupmatches$Datetime, 3))
cleaned_worldcupmatches = filter(cleaned_worldcupmatches, cleaned_worldcupmatches$HasSpyData == TRUE)

cleaned_worldcupmatches
```

## Get Spy Data

```{r}
#Get Spy data within 3 hours of the Game For both World cups
allspydata2014 = getSpyAndGameDataWithinWorldcup(filter(cleaned_worldcupmatches, Year==2014), spydata, 3)

allspydata2010 = getSpyAndGameDataWithinWorldcup(filter(cleaned_worldcupmatches, Year==2010), spydata, 3)

```

## Normalizing the average price

```{r}
#Normalize the Data so that we can compare fairly the growth of stock price
max.2014 = max(allspydata2014$average)
max.2010 = max(allspydata2010$average)


min.2014 = min(allspydata2014$average)
min.2010 = min(allspydata2010$average)

difference.2014 = (max.2014) - (min.2014)
difference.2010 = (max.2010) - (min.2010)
normalized2014average = (allspydata2014$average-min.2014)/difference.2014
normalized2010average = (allspydata2010$average-min.2010)/difference.2010


allspydata2014 = add_column(allspydata2014, "normalized.average" = normalized2014average)
allspydata2010 = add_column(allspydata2010, "normalized.average" = normalized2010average)

allspydata = union_all(allspydata2014, allspydata2010)

allspydata
```

Before we could perform any tests on the data sets, we first had to take into account the stock price differences between the years 2010 and 2014. In order to normalize the stock price data, we had to split the data into each year and normalize the prices within that year. The method we chose to normalize the prices was subtracting the minimum price for that year from each price. This value was then divided by the difference between the maximum and minimum price for that year. Each normalized price is a value between zero and one. When interpreting these values, the closer the normalized value is to one, the closer the price is to the maximum for that year. Similarly, the closer the value is to zero, the closer the price is to the minimum for that year. By normalizing the prices for each year, we created an even playing field for price data, as they are all relative to the stock market climate in their given year.

# Exploring the Relationship of Time and Volume

Let us explore the Correlation between the Time form game and the Price of the stock

## Correlations

```{r}
ggpairs(allspydata, columns = c("time.from.game", "average", "volume", "normalized.average"))
```

```{r}
kable(cor(allspydata[, c(7,9,31,32)]))
```

After cleaning the data set, we computed the cross-correlations for the variables: time from the game, volume of trades, average price, and normalized average price. We were most interested in seeing how time from game affected volume of trades since volume shows stock productivity. We found that time from game had the highest correlation with average price, the correlation was 0.215. We also found that the correlation between time from game and volume was .033. These low values make sense given the data we had access to. We did not have data for commonly traded stocks by day traders. Nevertheless, we wanted to regress with volume as the dependent variable and time from game as the independent variable. After attempting multiple regression types, nonlinear regression gave us the best fit. 

## Linear Regression

```{r}
ggplot(allspydata) + geom_point(aes(time.from.game,normalized.average))
```

```{r}
ggplot(allspydata, aes(time.from.game,log(volume))) + geom_point() + geom_smooth() + geom_vline(xintercept= 0, color = "red") + geom_vline(xintercept = 5400, color ="red")

```

The model we used for nonlinear regression is a Generalized Additive Model, also known as a GAM. A GAM is made up of splines, splines are smooth polynomial functions that cover a small range. These splines replace the beta coefficients that we normally see in linear regression, so that we can regress on nonlinear data. When we graph our nonlinear model over the volume data, we see that there is a spike in volume of trades 5000 seconds before the time of the game, which is about an hour and a half before the game. After this spike, there is a gradual increase once the game starts and the positive trend continues as the game finishes.

\
\

# Sample Games from a World Cup

```{r}
set.seed(100) 
samplesize = 10
worldcupmatches2014 = filter(cleaned_worldcupmatches, Year == 2014)
sampleworldcupgames = sample_n(worldcupmatches2014,size = samplesize)
```

```{r}
game1 <- getSpyAndGameDataForOneGame(spydata, sampleworldcupgames, 1, 3)
volumesp1 <- ggplot(game1, mapping = aes(x=date, y=log(volume))) + geom_point(aes(color =average)) + geom_vline(xintercept= sampleworldcupgames$Datetime[1]) + ggtitle("Volume over Time, Sample Game 1")
volumesp1
```

```{r}
game2 <- getSpyAndGameDataForOneGame(spydata, sampleworldcupgames, 2, 3)
volumesp2 <- ggplot(game2, mapping = aes(x=date, y=log(volume))) + geom_point(aes(color =average)) + geom_vline(xintercept= sampleworldcupgames$Datetime[2])+ ggtitle("Volume over Time, Sample Game 2")
volumesp2
#ERROR AGAIN WE DONT HAVE FULL DATA FOR THIS GAME
```

```{r}
game3 <- getSpyAndGameDataForOneGame(spydata, sampleworldcupgames, 3, 3)
volumesp3 <- ggplot(game3, mapping = aes(x=date, y=log(volume))) + geom_point(aes(color =average)) + geom_vline(xintercept= sampleworldcupgames$Datetime[3])+ ggtitle("Volume over Time, Sample Game 3")
volumesp3
```

```{r}
game4 <- getSpyAndGameDataForOneGame(spydata, sampleworldcupgames, 4, 3)
volumesp4 <- ggplot(game4, mapping = aes(x=date, y=log(volume))) + geom_point(aes(color =average)) + geom_vline(xintercept= sampleworldcupgames$Datetime[4])+ ggtitle("Volume over Time, Sample Game 4")
volumesp4
```

```{r}
game5 <- getSpyAndGameDataForOneGame(spydata, sampleworldcupgames, 5, 3)
volumesp5 <- ggplot(game5, mapping = aes(x=date, y=log(volume))) + geom_point(aes(color =average)) + geom_vline(xintercept= sampleworldcupgames$Datetime[5])+ ggtitle("Volume over Time, Sample Game 5")
volumesp5
```

## Price Scatter Plots

Using average price

```{r}
pricesp1 <- ggplot(game1) + geom_point(mapping = aes(x=date, y=average), color ="red") + geom_vline(xintercept= sampleworldcupgames$Datetime[1])+ ggtitle("Price over Time, Sample Game 1")
pricesp1
```

```{r}
pricesp2 <- ggplot(game2) + geom_point(mapping = aes(x=date, y=average), color ="red") + geom_vline(xintercept= sampleworldcupgames$Datetime[2])+ ggtitle("Price over Time, Sample Game 2")
pricesp2
#ISSUE BECAUSE WE DONT HAVE DATA FOR 16:00 and that is time of the game
```

```{r}
pricesp3 <- ggplot(game3) + geom_point(mapping = aes(x=date, y=average), color ="red") + geom_vline(xintercept= sampleworldcupgames$Datetime[3])+ ggtitle("Price over Time, Sample Game 3")
pricesp3
```

```{r}
pricesp4 <- ggplot(game4) + geom_point(mapping = aes(x=date, y=average), color ="red") + geom_vline(xintercept= sampleworldcupgames$Datetime[4])+ ggtitle("Price over Time, Sample Game 4")
pricesp4
```

```{r}
pricesp5 <- ggplot(game5, aes(date)) + geom_point(aes(y=average, ), color ="red") + geom_vline(xintercept= sampleworldcupgames$Datetime[5])+ ggtitle("Price over Time, Sample Game 5")
pricesp5
```

We next looked at the volume of trading over a range of time from five sample games. Using ggpoint, we plotted the scatter plot of the market volume over a period of a few hours before and after the start of the game, and we indicated the start time with a vertical line on the graph. The values for volume are very large and varied, which makes the graphs difficult to read. To make them more digestible, we instead used the log of volume as the dependent variable, giving us smaller numbers. We found that, in general, there is a slight dip in the volume around the start time of the game. This agrees with findings from our research that stock market activity drops durring the World Cup as more attention is directed towards the games. We also used scatter plots to look at the average price of stocks over a range of time around the start time of sample games. As for the volume plots, we indicated the start time of the game with a vertical line on the graph. The figures show that the average price tends to drop just before the start of the game, and then increase. These trends in price lead us to explore the times it would be best to both buy and sell stocks.

\
\

## Just USA Games

```{r}
worldcupmatchesUSAhome = filter(cleaned_worldcupmatches, Home.Team.Name == "USA")
worldcupmatchesUSAaway = filter(cleaned_worldcupmatches, Away.Team.Name == "USA")
worldcupmatchesUSA <- union_all(worldcupmatchesUSAaway,worldcupmatchesUSAhome)
worldcupmatchesUSA
```

```{r}
game1USA <- getSpyAndGameDataForOneGame(spydata, worldcupmatchesUSA, 1, 3)
volumesp1USA <- ggplot(game1USA) + geom_point(mapping = aes(x=date, y=log(volume)), color ="red") + geom_vline(xintercept= worldcupmatchesUSA$Datetime[1])+ ggtitle("Volume over Time, USA Game 1, 2010")
volumesp1USA
```

```{r}
game2USA <- getSpyAndGameDataForOneGame(spydata, worldcupmatchesUSA, 2, 3)
volumesp2USA <- ggplot(game2USA) + geom_point(mapping = aes(x=date, y=log(volume)), color ="red") + geom_vline(xintercept= worldcupmatchesUSA$Datetime[2])+ ggtitle("Volume over Time, USA Game 2, 2014")
volumesp2USA
```

```{r}
game3USA <- getSpyAndGameDataForOneGame(spydata, worldcupmatchesUSA, 3, 3)
volumesp3USA <- ggplot(game3USA) + geom_point(mapping = aes(x=date, y=volume), color ="red") + geom_vline(xintercept= worldcupmatchesUSA$Datetime[3])+ ggtitle("Volume over Time, USA Game 2, 2010")
volumesp3USA
```

```{r}
game4USA <- getSpyAndGameDataForOneGame(spydata, worldcupmatchesUSA, 4, 3)
volumesp4USA <- ggplot(game4USA) + geom_point(mapping = aes(x=date, y=volume), color ="red") + geom_vline(xintercept= worldcupmatchesUSA$Datetime[4])+ ggtitle("Volume over Time, USA Game 1, 2014")
volumesp4USA
```

```{r}
pricesp4USA <- ggplot(game4USA) + geom_point(mapping = aes(x=date, y=average), color ="red") + geom_vline(xintercept= worldcupmatchesUSA$Datetime[4])+ ggtitle("Price over Time, USA Game 1, 2014")
pricesp4USA
```

# Is there a way to make money during World Cup Games?

# What time should you look to sell?

```{r}
price_list = c()
time_list = c()
for (game in 1:nrow(cleaned_worldcupmatches)) {
  price_game_data <- getSpyAndGameDataForOneGame(spydata, cleaned_worldcupmatches, game, 3)
  highestdatapoint = price_game_data[which.max(price_game_data$average),]
  highestpriceforgame <- highestdatapoint$average
  time <- highestdatapoint$"time.from.game"
  price_list = append(price_list, highestpriceforgame)
  time_list = append(time_list, time)
}
```

```{r}
optimal_selling.df <- data.frame("price" = price_list, "time" = time_list)
optimal_selling.df
```

Do THis for Optimal Buying time?

```{r}
ggplot(optimal_selling.df) + geom_histogram(mapping = aes(x= time)) + ggtitle("How Often Best Selling Time is vs. Time From Game")
```

# What time should you look to buy?

```{r}
price_list = c()
time_list = c()
for (game in 1:nrow(cleaned_worldcupmatches)) {
  price_game_data <- getSpyAndGameDataForOneGame(spydata, cleaned_worldcupmatches, game, 3)
  lowestdatapoint = price_game_data[which.min(price_game_data$average),]
  lowestpriceforgame <- lowestdatapoint$average
  time <- lowestdatapoint$"time.from.game"
  price_list = append(price_list, lowestpriceforgame)
  time_list = append(time_list, time)
}
```

```{r}
optimal_buying.df <- data.frame("price" = price_list, "time" = time_list)
ggplot(optimal_buying.df) + geom_histogram(mapping = aes(x= time)) + ggtitle("How Often Best Buying Time is vs. Time From Game")
```

Another idea we wanted to explore was the optimal buying and selling times for SPY. The optimal buying time would be when the SPY price is the lowest and the optimal selling time would be when the SPY price is the highest. In order to find the optimal times from the game, we found the maximum and minimum price of SPY during each game. From there, we found the corresponding time from game for these maximum and minimum price values. With this data, we created two histograms: one that demonstrates the optimal buying times from a world cup game and another that demonstrated the optimal selling times from a world cup game. The histograms display the time from game on the x-axis and show the amount of hames that have their optimal selling or buying time at that time value. For the optimal selling histogram, there are peaks at 10,000 seconds and 5,000 seconds before the game. These two time values correspond to around three hours before the game and one and a half hours before the game, respectively. There were other outliers in the histogram, but these two values seem to be the best times around a world cup game to sell SPY stock, based on our data. For the optimal buying histogram, there was a large peak around 10,000 seconds before the game, which is three hours before the game. The histogram is skewed right, so the times where the SPY price was the lowest was before the game. From this histogram, it appears that the best time to buy SPY stock is three hours before the start of a world cup game. 

\
\

# Does the importance of the game matter in the total amount of trades?

```{r}
#Box Plot for Trades
ggplot(allspydata, aes(log(volume), Stage, fill = Stage)) + geom_boxplot() + ggtitle("Log(Volume) vs. Stage Of Game and Groups")
```

```{r}
allspydata$Stage[grepl("Group", allspydata$Stage, fixed = TRUE)]  = "Group"
ggplot(allspydata, aes(log(volume), Stage, fill = Stage)) + geom_boxplot() + ggtitle("Log(Volume) vs. Stage Of Game")
```
